Tidyverse

 

Rodney Dyer, PhD

Topic Guide

  • Action verbs
  • Workflows
  • Strategies

R Data Workflow

Describe the daytime air temperatures at the Rice Rivers Center for February, 2014 by day of the week.

To do this, we have the following general workflow.

Action Verbs

Data Operators

There are a finite number of action verbs that can be used on raw data. They are combined to yield meaningful inferences from our data.

  1. Is there more sun on Fridays?
  2. What is the distribution of high-tide depths for each day in January?
  3. Is there a relationship between salinity & pH?

Select

Identify only subset of data columns that you are interested in using.

Filter

Use only some subset of rows in the data based upon qualities within the columns themselves.

Arrange

Reorder the data using values in one or more columns to sort.

Mutate

Convert one data type to another, scaling, combining, or making any other derivative component.

Summarize

Perform operations on the data to characterize trends in the raw data as summary statistics.

Group

Partition the data set into groups based upon some taxonomy of categorization.

Workflows

Combinations Yield Inferences

The manner in which we organize these action verbs yields an infinite number of combinations.

 

 

The Treachery of Images

The Pipe Operator

In R we use this grammar.

data %>% Y()

To take the values in data and pass them as if you entered the data as the first argument to the function Y().

These pipes can be chained together into a single operation.

data %>%
  func1() %>%
  func2() %>%
  func3() -> newData

Examples

The Data

The data we will be working with consist of data from the Rice Rivers Center.

The Raw Data

The raw data are retained in a single Google Spreadsheet that is published as a CSV file and can be found here.

The Data in R

So, let us load it in here.

url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Column Selection

Using the column numbers instead of names.

df <- rice[ c(1,3,5,13)]
summary( df )
   DateTime              PAR              WindDir             PH      
 Length:8199        Min.   :   0.000   Min.   :  0.00   Min.   :6.43  
 Class :character   1st Qu.:   0.000   1st Qu.: 37.31   1st Qu.:7.50  
 Mode  :character   Median :   0.046   Median :137.30   Median :7.58  
                    Mean   : 241.984   Mean   :146.20   Mean   :7.60  
                    3rd Qu.: 337.900   3rd Qu.:249.95   3rd Qu.:7.69  
                    Max.   :1957.000   Max.   :360.00   Max.   :9.00  
                                                        NA's   :1     

Named Column Selection

  • Column names are probably better than column numbers

  • Additional assistance from RStudio via pop-ups.

  • Longer term readability (like next Tuesday & Beyond!)

Figure 1: Popup help for column names in RStudio for a data frame in memory

Data Pliars

The dplyr library defines a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges using select, filter, arrange, mutate, group_by, and summarise functionality.

It is part of tidyverse

Select Tidy

In tidyverse we can use the column names and do not need to be quoted.

rice %>%
  select( DateTime ) -> df
head( df )
# A tibble: 6 × 1
  DateTime            
  <chr>               
1 1/1/2014 12:00:00 AM
2 1/1/2014 12:15:00 AM
3 1/1/2014 12:30:00 AM
4 1/1/2014 12:45:00 AM
5 1/1/2014 1:00:00 AM 
6 1/1/2014 1:15:00 AM 

Selecting Several

The select() function allows you to choose many columns of data to work with.

rice %>%
  select( DateTime, AirTempF ) -> df 
head(df)
# A tibble: 6 × 2
  DateTime             AirTempF
  <chr>                   <dbl>
1 1/1/2014 12:00:00 AM     31.0
2 1/1/2014 12:15:00 AM     30.7
3 1/1/2014 12:30:00 AM     31.2
4 1/1/2014 12:45:00 AM     30.5
5 1/1/2014 1:00:00 AM      30.9
6 1/1/2014 1:15:00 AM      30.6

Selecting to Rearrange

There are times when we want to

names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Selecting to Rearrange

Love the everything() function.

rice %>%
  select( RecordID, ODO_mgl, PH, everything() ) -> df
names( df )
 [1] "RecordID"                       "ODO_mgl"                       
 [3] "PH"                             "DateTime"                      
 [5] "PAR"                            "WindSpeed_mph"                 
 [7] "WindDir"                        "AirTempF"                      
 [9] "RelHumidity"                    "BP_HG"                         
[11] "Rain_in"                        "H2O_TempC"                     
[13] "SpCond_mScm"                    "Salinity_ppt"                  
[15] "PH_mv"                          "Turbidity_ntu"                 
[17] "Chla_ugl"                       "BGAPC_CML"                     
[19] "BGAPC_rfu"                      "ODO_sat"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Filter

The function filter() works to select records (rows) based upon some criteria.

range( rice$AirTempF )
[1]  3.749 74.870

 

rice %>%
  filter( AirTempF < 32 ) -> df
range( df$AirTempF )
[1]  3.749 31.990

Combinations of Filters

nrow( rice )
[1] 8199

 

rice %>%
  filter( AirTempF > 32, 
          !is.na(PH), 
          Turbidity_ntu < 15) -> df
nrow(df)
[1] 1449

Arrange

We can sort entire data.frame objects based upon the values in one or more of the columns using the arrange() function.

rice$WindSpeed_mph[1]
[1] 3.87

 

rice %>%
  arrange( WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 0

Reverse Sorting Order

To reverse the order, use the negative operator on the column name object in the function.

rice %>%
  arrange( -WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 30.65

Cominations of Sorting

You can also sort using several criteria.

rice %>%
  arrange( -WindSpeed_mph, WindDir ) -> df

Mutations

The mutate() function creates new columns of data.

class( rice$DateTime )
[1] "character"

Mutations

library( lubridate )
rice %>%
  mutate( Date = mdy_hms(DateTime, tz = "EST") ) -> df
class( df$Date )
[1] "POSIXct" "POSIXt" 
summary( df$Date )
                 Min.               1st Qu.                Median 
"2014-01-01 00:00:00" "2014-01-22 08:22:30" "2014-02-12 16:45:00" 
                 Mean               3rd Qu.                  Max. 
"2014-02-12 16:45:00" "2014-03-06 01:07:30" "2014-03-27 09:30:00" 

Simultaneous Mutations

You can make several mutations in one call or you can pipe several mutation events at one time.

rice %>%
  mutate( Date = mdy_hms(DateTime, tz = "EST"), 
          Month = month(Date, label = TRUE) ) -> df
summary( df$Month )
 Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec 
2976 2688 2535    0    0    0    0    0    0    0    0    0 

Summarize

The summarize() function derives inferences from the current data.frame and produces a new one.

rice %>%
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean(H2O_TempC, na.rm=TRUE))
# A tibble: 1 × 2
  `Air Temp` `Water Temp`
       <dbl>        <dbl>
1       38.8         5.53

Group & Summarize

The group_by() function allows you to arbitrarily pull together subset of data and prepare them to be worked on by something like summary().

rice %>%
  mutate( Date = mdy_hms( DateTime, 
                          tz="EST"),
          Month = month( Date, 
                         abbr = FALSE, 
                         label=TRUE) ) %>%
  group_by( Month ) %>%
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean( H2O_TempC, 
                                  na.rm=TRUE) ) 
# A tibble: 3 × 3
  Month    `Air Temp` `Water Temp`
  <ord>         <dbl>        <dbl>
1 January        34.7         3.68
2 February       39.7         5.29
3 March          42.6         7.96

Stratgies

Flows

Here are some strategies to consider.

  1. Do not think that you have to do the whole thing at once. I typically build up the workflow, one line at a time. Make sure the output from the previous line is what you think it should be then add the next one.
  2. Keep your code open and airy, it makes it easier to read and to catch any logical errors that may arise.
  3. You can pipe into a lot of different functions. In fact, any function that takes a data frame can be the recipient of a pipe. While developing a workflow, I will often pipe into things like head(), summary(), or View() to take a look at what is coming out of my workflow to make sure it resembles what I think it should look like.

Quoted Names

There are times when working with data that we may want to use reasonable names.

names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Quoted Names

If the names are set reasonably in the workflow, then they will be piped directly into tables and figures correctly. Here is a fully formed output from R.

rice %>%
  mutate( Date = mdy_hms( DateTime, 
                          tz="EST"),
          Month = month( Date, 
                         abbr = FALSE, 
                         label=TRUE),
          AirTemp = (AirTempF-32) * 5/9 ) %>%
  group_by( Month ) %>%
  summarize( `Ambient Air` = mean( AirTemp), 
             `Water Surface` = mean( H2O_TempC, 
                                  na.rm=TRUE) ) -> df 

Tabular Output

So far, we’ve been showing the output as raw R output in the console or in Markdown. However, let’s take it up a notch and get a bit more formal and use a package that will make professional looking tables.

Here is the official GitHub repository for the project.

Using kable

Simply pipe the data into the kable() function.

library( knitr )

df %>%
  kable()
Month Ambient Air Water Surface
January 1.498863 3.676346
February 4.302240 5.289107
March 5.888374 7.959093

Options to kable

df %>%
  kable( digits = 2, 
         caption = "Air and water temperatures measured at the Rice Rivers Center.") -> kbl

kbl
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

For more options see ?kable

Making Fancier Tables

There is a second package, kableExtra, that adds a lot more functionality to the display of tabular output.

Here is a great overview of the various customization that is available.

Basic Style

library( kableExtra )

kbl %>% 
  kable_styling()
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Paper Style

kbl %>% 
  kable_paper()
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Classic Style

kbl %>% 
  kable_classic( full_width = FALSE )
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Fancy Material Styles

kbl %>% 
  kable_material( c("striped", "hover"), 
                  full_width = FALSE )
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Quoted Names & More Fancy Headings

kbl %>% 
  kable_paper( full_width=FALSE) %>%
  add_header_above( c(" "=1, "Temperature (°C)" = 2))
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Questions

If you have any questions, please feel free to post to the Canvas discussion board for the class, drop me an email, or show up for Office Hours.

Peter Sellers looking bored